library(archive)
library(dplyr)
library(plotly)
library(knitr)
COMPUTE_GRID_SEARCH <- FALSE
Let’s begin by discussing the first step of our code, which handles
extracting the image data stored in a compressed archive. We start by
specifying the file name "Training.rar", which contains all
of our images. To extract the contents, we use the
archive_extract() function from the archive package. We
wrap this call in a try() block with the parameter
silent = TRUE. This is a safeguard: if an error occurs
during extraction(for example, if the archive has already been
extracted), and we can smoothly continue with the rest of the
processing.
# Specify the file path
rar_file <- "Training.rar"
# Extract the contents
#unpacked_files <- archive::archive_extract(rar_file)
try(archive_extract("Training.rar"), silent = TRUE)
Following the extraction, we define two helper functions to manage
and organize our image files. The first function,
extract_id, takes a filename and uses a regular expression
to extract the first sequence of digits from it. We then convert that
sequence into a numeric value, which serves as a unique identifier for
each image (this identifier typically corresponds to a subject’s ID).
The second function, get_sorted_filenames, makes use of
list.files() to retrieve all JPEG files in the “Training”
folder. We then apply our extract_id function to each
filename to obtain their numeric IDs, sort the filenames based on these
IDs, and return the sorted list.
extract_id <- function(filename) {
matches <- regmatches(basename(filename), regexpr("[0-9]+", basename(filename)))
return(as.numeric(matches)) # Convert to numeric
}
get_sorted_filenames <- function(){
# Get all .jpg files
image_files <- list.files("Training", pattern = "\\.jpg$", full.names = TRUE)
sorted_indices <- order(sapply(image_files, extract_id))
image_files <- image_files[sorted_indices]
return(image_files)
}
Once the filenames are sorted, we proceed to load and process the
images. We store the sorted list of image file paths in the variable
image_files and extract the numeric IDs again for potential
labeling or further processing. Next, we iterate over each file, loading
the image using OpenImageR::readImage(). Since our images
are in color, we need to handle each of the three channels—red, green,
and blue—separately.
For each channel, the two-dimensional array that represents the pixel
intensities is flattened into a one-dimensional vector using
as.vector(). We then concatenate these three vectors into a
single vector, effectively transforming each image into a long vector
that contains all the pixel values sequentially. Mathematically, if an
image is represented by a 3D array \[
I \in \mathbb{R}^{H \times W \times 3},
\] then the red, green, and blue channels are flattened into
vectors \[
r, g, b \in \mathbb{R}^{HW},
\] respectively, and concatenated as: \[
\mathbf{x} = \begin{bmatrix} r \\ g \\ b \end{bmatrix} \in
\mathbb{R}^{3HW}.
\]
Each of these vectors is stored in a list, using the file name as the key. This organized storage allows us to later combine all the image vectors into a single matrix for further analysis.
# Get all .jpg files in the Training folder
image_files <- get_sorted_filenames()
#Remove the files
#image_files <- image_files[!image_files %in% c("Training/1AT.jpg", "Training/3BT.jpg")]
image_ids <- sapply(image_files, extract_id)
image_rows <- list()
for (file in image_files) {
img <- OpenImageR::readImage(file) # Load image
# Flatten RGB channels and concatenate into a single row
red <- as.vector(img[,,1]) # Flatten red channel
green <- as.vector(img[,,2]) # Flatten green channel
blue <- as.vector(img[,,3]) # Flatten blue channel
# Combine R, G, B into one row and store it
image_rows[[file]] <- c(red, green, blue)
}
# Convert the list of rows into a matrix
M <- do.call(rbind, image_rows)
# Check dimensions (rows = number of images, columns = total pixels * 3)
dim(M)
## [1] 150 108000
The function compute_pca begins by calculating the mean
vector of the input data matrix, where each column corresponds to a
feature (in our case, pixel values). We use the colMeans
function to compute this mean and store it in the variable
mean_m. Immediately afterward, we center the data by
subtracting this mean from every observation. We achieve this centering
using the sweep function, which subtracts the mean from
each column of the data matrix, ensuring that the new data matrix has a
zero mean along every dimension.
Once the data is centered, we compute a covariance-like matrix. However, rather than calculating the full \(d \times d\) covariance matrix (with \(d\) being the number of pixels), we compute a smaller \(n \times n\) matrix, where \(n\) is the number of observations. We do this by multiplying the centered data matrix by its transpose and scaling the result by \(\frac{1}{n-1}\). This yields
\[ C = \frac{1}{n-1} X_c X_c^\top, \]
where \(X_c\) represents our centered data matrix. This smaller covariance matrix is computationally more efficient to work with when the dimensionality is very high.
Next, we perform an eigen-decomposition on this smaller covariance
matrix using the eigen function, which returns a set of
eigenvalues and eigenvectors. Because these eigenvectors are computed in
the space of the \(n \times n\) matrix,
they do not directly represent the directions in the original
high-dimensional pixel space. To map them back, the function multiplies
the transpose of the centered data by these eigenvectors. This operation
effectively transfers the principal component directions back into the
original space. After that, we normalize each column (each representing
an eigenvector in pixel space) by dividing it by its Euclidean norm,
ensuring that every principal component has unit length.
Finally, we compute the proportion of variance explained by each
principal component. This is done by dividing each eigenvalue by the sum
of all eigenvalues, which gives us a normalized vector of explained
variances. The function then returns a list containing the mean vector
(mean_obs), the matrix of normalized eigenvectors
(P), the vector of explained variance proportions
(D), and the raw eigenvalues (E).
compute_pca <- function(data) {
# Compute mean
mean_m <- colMeans(data)
# Subtract the mean from each row
centered_matrix <- sweep(data, 2, mean_m, FUN = "-")
n <- nrow(centered_matrix)
var_cov_small <- (1 / (n - 1)) * (centered_matrix %*% t(centered_matrix))
eigen_dec <- eigen(var_cov_small)
P <- t(centered_matrix) %*% eigen_dec$vectors
P <- apply(P, 2, function(p) p / sqrt(sum(p^2)))
D <- eigen_dec$values / sum(eigen_dec$values)
return(list(mean_obs = mean_m, P = P, D = D, E = eigen_dec$values))
}
pca_results <- compute_pca(M)
mean_obs <- pca_results$mean_obs
P <- pca_results$P #eigenvectors in pixel space normalized #image space
D <- pca_results$D #variance explained by eigenvalue
E <- pca_results$E #eigenvalues
#V <- pca_results$V #eigenvectors in pixel space normalized
pca_results <- compute_pca(M)
mean_obs <- pca_results$mean_obs
P <- pca_results$P #eigenvectors in pixel space normalized #image space
D <- pca_results$D #variance explained by eigenvalue
E <- pca_results$E #eigenvalues
#V <- pca_results$V #eigenvectors in pixel space normalized
The projection of the images in the pca space will be the following:
Y <- sweep(M, 2, mean_obs, FUN = "-") %*% P
We can visualize the pc component of the
image_index image as follows. For example we can see the
first image in the dataset projection to the first principal
component.
The function visualize_pc lets us see how a specific
principal component contributes to one of our images. Essentially, we
extract the desired principal component (in our case, the first one)
from the matrix \(P\), which represents
the direction capturing a significant portion of the variance in our
data. We then take the corresponding coordinate for our selected image
from the projection matrix \(Y\) (this
tells us how much of that principal component is present in the
image).
To visualize this, we reconstruct the image by scaling the principal
component by that coordinate and then adding back the overall mean image
vector. Mathematically, we compute: \[
x_pc1 = y_1 \times p_1 + \text{mean_obs},
\] where \(p_1\) is the chosen
principal component and \(y_1\) is the
image’s coordinate along this component. Once we have this reconstructed
vector, we reshape it into the original image dimensions (height, width,
and 3 color channels) to form an image matrix that we can display using
OpenImageR::imageShow().
visualize_pc <- function(image_index, pc, mean_obs, P, Y, img_height = 200, img_width = 180) {
# Extract the first principal component
p1 <- P[,pc] # First principal component (size p)
# Get the projection of the specific image onto PC1
y1 <- Y[image_index, 1] # First coordinate in PCA space
# Reconstruct only PC1 contribution
x_pc1 <- y1 * p1 + mean_obs
# Reshape into image dimensions
image_matrix <- array(x_pc1, dim = c(img_height, img_width, 3))
# Plot the PC1 contribution as an image
return(image_matrix)
}
# Example: Visualizing the PC1 contribution for image x
image_matrix <- visualize_pc(1,1, mean_obs, P, Y)
OpenImageR::imageShow(image_matrix)
In this part, we take a closer look at how our PCA has organized the image data by visualizing the projections onto the first three principal components. By plotting all images in this 3D space, we can see that images belonging to the same person tend to group together, which is a strong indication that our PCA is capturing the underlying structure of the data effectively.
To achieve this, we define the function
plot_pca_3d_interactive. In this function, we create a data
frame that includes the first three principal components (PC1, PC2, and
PC3) and assign a color based on each image’s ID. This allows us to
visually differentiate the images by subject. We then use the
plot_ly function from the Plotly library to generate an
interactive 3D scatter plot. Each point in the plot represents an image,
and hovering over a point displays the corresponding ID. Finally, we
call this function with our projection matrix Y and the
image_ids to display the visualization.
plot_pca_3d_interactive <- function(Y, image_ids) {
df <- data.frame(PC1 = Y[,1], PC2 = Y[,2], PC3 = Y[,3], ID = factor(image_ids))
plot_ly(df, x = ~PC1, y = ~PC2, z = ~PC3, color = ~ID, text = ~paste("ID:", ID),
type = "scatter3d", mode = "markers") %>%
layout(title = "PCA Projection (PC1 vs PC2 vs PC3)")
}
plot_pca_3d_interactive(Y, image_ids)
The function fda is designed to perform Fisher
Discriminant Analysis (FDA) on a dataset, where the input matrix \(Y\) has dimensions \(n \times p\) and its last column represents
the class labels. We begin by extracting the class labels from the last
column of \(Y\) and converting them
into a factor so that we can properly identify the different groups. The
remaining columns of \(Y\) form the
feature matrix \(X\), which contains
the observations that we will use for the FDA.
Once we have isolated \(X\), we calculate the overall mean vector \(m\) of the features using the column means of \(X\). This mean vector represents the central tendency of the entire dataset in the feature space. Next, we determine the unique classes present in the class vector and initialize two key scatter matrices: the within-class scatter matrix \(S_W\) and the between-class scatter matrix \(S_B\). Both matrices are initialized as zero matrices of size \(p \times p\), where \(p\) is the number of features.
For each unique class, we extract the submatrix \(X_{\text{class}}\) containing only the
observations that belong to that class and compute the class mean
(denoted as \(\mu_{\text{class}}\)) for
those observations. We then update the between-class scatter matrix
\(S_B\) by adding the weighted outer
product of the difference between the class mean and the overall mean.
Mathematically, for each class with \(n_i\) observations, the contribution is
\[
n_i (\mu_{\text{class}} - m)(\mu_{\text{class}} - m)^T.
\]
At the same time, we compute the within-class scatter matrix \(S_W\) by iterating over each observation in
the class. For each observation \(x_j\)
in \(X_{\text{class}}\), we calculate
the difference \(x_j -
\mu_{\text{class}}\) and add the outer product of this difference
with itself to \(S_W\), that is, \[
(x_j - \mu_{\text{class}})(x_j - \mu_{\text{class}})^T.
\]
After processing all the classes, we perform an eigen-decomposition
on the matrix product \(S_W^{-1} S_B\).
This is done by inverting \(S_W\) using
the solve function and then multiplying it by \(S_B\). The eigenvalues and eigenvectors
resulting from this operation indicate the directions in the feature
space that maximize the ratio of between-class scatter to within-class
scatter. We then sort the eigenvalues in decreasing order and reorder
the eigenvectors accordingly so that the most discriminative directions
appear first. Any imaginary part is omitted because it induces errors
and typically consists of very small values.
Finally, the function returns a list containing the overall mean vector \(m\), the sorted eigenvectors (denoted as \(P\)), and the eigenvalues (denoted as \(D\)). These eigenvectors represent the FDA directions that optimally separate the classes, while the eigenvalues reflect the discriminative power along those directions.
fda <- function(Y) {
# Y is an n x p matrix obtained from a PCA where the last column has to contain the class.
# We extract the class column and convert it to a factor
class_vector <- Y[, ncol(Y)]
class_vector <- as.factor(class_vector)
# We extract the feature matrix (all columns except the last one)
X <- Y[, -ncol(Y), drop = FALSE]
# We compute the overall mean of the features
m <- colMeans(X)
# Determine the number of classes and the unique classes
people <- unique(class_vector)
# We obtain the number of features (dimensions) to select the size of S_W and S_B
p <- ncol(X)
# We initialize the within-class scatter matrix (S_W) and the between-class scatter matrix (S_B)
S_W <- matrix(0, nrow = p, ncol = p)
S_B <- matrix(0, nrow = p, ncol = p)
# For each class, we compute the class mean and we accumulate contributions to S_B and S_W
for (cl in people) {
# Submatrix of observations for the current class 'cl'
X_class <- X[class_vector == cl, , drop = FALSE]
n_i <- nrow(X_class)
# We compute the mean of the current class
mean_class <- colMeans(X_class)
# We update S_B for each class
# We perform the outer product of (mean_class - m), weighted by n_i
S_B <- S_B + n_i * ((mean_class - m) %*% t((mean_class - m)))
# To Update S_W we need to access every observation j of the class one by one
# We create a for loop which goes through each observation to perform the
# sum of (x_j - mean_class)(x_j - mean_class)^T
for (j in 1:n_i) {
diff_j <- as.matrix(X_class[j, ] - mean_class)
S_W <- S_W + diff_j %*% t(diff_j)
}
}
# We compute the eigenvalues and eigenvectors for the matrix solve(S_W) %*% S_B
eig_values <- eigen(solve(S_W) %*% S_B)
# We obtain the sorted indices of the eigenvalues in decreasing order
sort_eig <- order(eig_values$values, decreasing = TRUE)
# We sort the eigenvectors and eigenvalues according these indices
eigenvectors <- Re(eig_values$vectors[, sort_eig])
eigenvalues <- Re(eig_values$values[sort_eig]) # Ensure real values
D <- eig_values$values / sum(eig_values$values)
return(list(mean_fda = m, P_fda = eigenvectors, D_fda = D, E_fda = eigenvalues))
}
The face recognition will be performed using a KNN algorithm. Given a new image of a person, the system should recognize whose person it is in case we have it in the dataset, or classify it as an unknown person if it doesn’t. The algorithm follows these steps:
As our KNN classifier relies on computing distances between vectors,
we decided to try three different metrics that showed promising results
in the paper “Distance measures for PCA-based face recognition.” We
implemented these metrics( “mahalanobis”,
“weighted_angle”, and “modified_sse”) to see
which one works best for our application.
The function compute_distance is designed to calculate
the distance between two vectors, \(x\)
and \(y\), using one of the three
metrics. We choose the metric via the parameter metric, and
the function also takes a vector of eigenvalues \(E\) that is used to compute the weights in
some of these distance measures.
For the "mahalanobis" metric, we first set a constant
\(\\alpha = 0.25\). The function then
computes a weighting vector \(z\) as
follows:
\[ z = \sqrt{\max\left(0, \frac{E}{E + \alpha^2}\right)}, \]
After calculating \(z\), the function returns the negative sum of the element-wise product of \(z\), \(x\), and \(y\). The negative sum is used here so that, for classification, a lower value corresponds to a closer match.
When we choose the "weighted_angle" metric, the function
computes a different weighting vector by taking:
\[ z = \sqrt{\max\left(0, \frac{1}{E}\right)}, \]
It then returns the negative weighted inner product of \(x\) and \(y\), normalized by the product of their Euclidean norms. In mathematical terms, this distance is calculated as:
\[ \text{weighted angle} = -\frac{\sum_i z_i \, x_i \, y_i}{\sqrt{\sum_i x_i^2} \sqrt{\sum_i y_i^2}}, \]
which is similar to a cosine similarity measure with weights; the negative sign again ensures that a higher similarity results in a lower distance.
Lastly, for the "modified_sse" metric, the function
calculates a normalized squared error. It computes the sum of the
squared differences between the corresponding elements of \(x\) and \(y\), and then divides that sum by the
product of the sums of squares of \(x\)
and \(y\):
\[ \text{modified sse} = \frac{\sum_i (x_i - y_i)^2}{\left(\sum_i x_i^2\right)\left(\sum_i y_i^2\right)}. \]
# Computes the distance between 2 vectors. D is the eigenvalue list, required for some metrics.
compute_distance <- function(x, y, E, metric) {
if (metric == "mahalanobis") {
alpha <- 0.25
z <- sqrt(pmax(0, E / (E + alpha^2)))
return(-sum(z * x * y))
} else if (metric == "weighted_angle") {
z <- sqrt(pmax(0, 1 / E))
return(-sum(z * x * y) / sqrt(sum(x^2) * sum(y^2)))
} else if (metric == "modified_sse") {
return(sum((x - y)^2) / (sum(x^2) * sum(y^2)))
} else {
stop("Invalid metric. Choose from 'mahalanobis', 'weighted_angle', or 'modified_sse'.")
}
}
The function knn_classifier implements a k-nearest
neighbors (KNN) classifier. It takes several parameters: the number of
neighbors \(k\), a similarity
threshold, the new image’s vector \(y_{\text{new}}\), a matrix \(Y\) containing the representations of the
training images (one per row), the corresponding image IDs, a string
metric indicating which distance metric to use, and a
vector \(E\) of eigenvalues required by
some metrics.
The function first computes the distances between the new image \(y_{\text{new}}\) and every row of \(Y\). This is achieved by applying the
compute_distance function to each row, which calculates the
similarity according to the selected metric. The resulting distances are
stored in a vector, and a data frame is created with two columns: one
for the image IDs and another for the computed distances.
After obtaining the distance values, the data frame is sorted in ascending order by distance. The assumption here is that the closer two images are in the space, the greater similarity. The function then selects the \(k\) nearest neighbors (i.e., the \(k\) rows with the smallest distances). Instead of computing an average distance, which might be misleading if the metric produces both positive and negative values, the function considers the minimum distance among these \(k\) neighbors as the decisive measure. This minimum distance is compared against a pre-set threshold: if the smallest distance is greater than the threshold, the function returns \(0\), signifying that the new image is too dissimilar from any training image to be confidently recognized. This relies on the assumption that if the closest image is already too far to consider the new image as part of the dataset, the rest of the images too.
If the minimum distance is within the acceptable threshold, the function then examines the class IDs of the \(k\) nearest neighbors. It counts the occurrences of each ID and returns the majority class (i.e., the ID that appears most frequently). This classification decision is based on the intuition that if the new image is similar to several images from a particular class, it likely belongs to that same class.
In summary, the decision rule can be viewed as: if \[ \min \{d(y_{\text{new}}, y_i) : y_i \in \text{k nearest neighbors}\} > \text{threshold}, \] then return \(0\), otherwise classify \(y_{\text{new}}\) according to the majority label among its \(k\) nearest neighbors.
knn_classifier <- function(k, threshold, y_new, Y, image_ids, metric, E){
distances <- apply(Y, 1, function(y) compute_distance(y, y_new, E, metric) )
dist_df <- data.frame(ID = image_ids, Distance = distances)
# Sort the distance in ascending order
dist_df <- dist_df[order(dist_df$Distance), ]
# Take the first k distances
k_nearest <- head(dist_df, k)
# Average distance. If distance is larger than threshold return 0. Else continue
#distance <- mean(k_nearest$Distance) # Doesn't make sense in similarity metrics that can have negative values or ranges that go from negatives to positive. Averaging positive and negative values might cancel out the real distance.
# Min distance: If the closest/most similar, is too far, the rest will be farther.
distance <- min(k_nearest$Distance)
if (distance > threshold) {
return(0)
}
# value count of the ids. Return the majority class.
id_counts <- table(k_nearest$ID)
majority_id <- as.numeric(names(which.max(id_counts)))
return(majority_id)
}
We will perform a two steps grid search trying to maximize the balanced accuracy. We could do it in one step, and the results will be probably better, but this approach will highly reduce the computation time.
The procedure is the following:
n random train/test splits. In the
test split there are always images of people not present in the train
split.n random train/test splits. The best parameters will be
those that in average are best.The justification of this methodology relies on the randomness of the splits. A potential issue with optimizing a threshold on a single split is that it may overfit to the specific individuals present in that split, leading to poor generalization when applied to new data. By evaluating multiple train/test splits where the test set always contains individuals not seen in training, we mitigate this risk by averaging performance across diverse scenarios. This approach helps smooth out variability and ensures that the chosen threshold is not overly sensitive to a particular split but instead represents a robust choice across different partitions of the dataset. We apply the same methodology to the other parameters, reinforcing the generalizability of the final model selection.
Now we will describe the functions used for that.
The function create_random_splits_by_image is designed
to generate multiple random training and test splits based on image
identifiers, while ensuring that a specified number of unique subjects
(or classes) are completely excluded from the training set in each
split. The input parameter image_ids represents the class
labels associated with each image, and parameters such as
percent_test, num_splits, and
num_people_to_exclude allows us to control the proportion
of test data, the number of splits, and the number of subjects to
exclude from training and include their images in testing,
respectively.
At the beginning of the function, a fixed random seed is set using
set.seed(1) to guarantee reproducibility of the random
splits. The function then extracts the unique person id labels from
image_ids and proceeds to create the splits using
lapply. Before splitting the data,
num_people_to_exclude will be randomly selected (their
ids), and all the images belonging to this group will be excluded from
the splitting process. Then, the splits are created from the set of
remaining images, and the ones excluded initially are included in the
test set. In mathematical terms, if \(T\) represents the set of test indices and
\(E\) represents the excluded indices,
then the final test set is \(T \cup
E\), and the training set is defined as the complement of \(T \cup E\) within the full set of image
indices.
The function also provides diagnostic output by printing the excluded subjects along with the sizes of the test and training splits. It then checks for any classes that appear in the test set but not in the training set by comparing the unique class labels present in each set. If such cases are found, it prints a message indicating which subjects are missing from the training set and returns their indices as part of the split information.
In the end, the function returns a list of splits, where each split is a list containing the training indices, test indices, and any indices corresponding to missing people (people in the test set that are not present in the train set. This is used just for diagnostic). This approach is particularly useful to ensure that some subjects are completely unseen during training and can help evaluating the generalization and robustness of the classifier.
create_random_splits_by_image <- function(image_ids, percent_test = 0.3, num_splits = 5, num_people_to_exclude = 1) {
set.seed(1)
unique_people <- unique(image_ids) # Get unique class labels
splits <- lapply(1:num_splits, function(i) {
# Randomly select people to exclude from train
excluded_people <- sample(unique_people, num_people_to_exclude)
# Get indices of images belonging to excluded people
excluded_indices <- which(image_ids %in% excluded_people)
# Sample test images from the remaining people
remaining_images <- setdiff(names(image_ids), names(image_ids)[excluded_indices]) # Remove excluded images
test_images <- sample(remaining_images, round(length(remaining_images) * percent_test))
# Get test and train indices
test_indices <- which(names(image_ids) %in% test_images) # Normal test split
test_indices <- unique(c(test_indices, excluded_indices)) # Force excluded images into the test set
train_indices <- setdiff(seq_along(image_ids), test_indices) # Remaining are train
cat("\nExcluded people: ", paste(excluded_people, collapse = ", "), "\n")
cat("length test: ", length(test_indices), "\nlength train: ", length(train_indices), "\n")
# Extract class labels for train and test sets
train_people <- unique(image_ids[train_indices])
test_people <- unique(image_ids[test_indices])
# Check if any class is in test but not in train
missing_people <- setdiff(test_people, train_people)
missing_people_indices <- which(image_ids %in% missing_people)
if (length(missing_people) > 0) {
cat("The following people are in the test set but not in the train set: ", paste(missing_people, collapse = ", "), '\n')
}
return(list(train = train_indices, test = test_indices, missing_people = missing_people_indices))
})
return(splits)
}
The following functions will be used to find the best parameters (explained later).
run_knn_classification is a helper function used to
implement the grid search part that is common for the KNN face
recognizer that uses PCA projections from the one that it doesn’t. It
iterates over a grid of parameters (metric, k, threshold). It computes
the known and unknown accuracies. The accuracy of people from the test
set that belong to the train set, and the accuracy of people from the
test set that are not present in the train set respectively. Then
computes the balanced accuracy, that is the mean of the previous
ones.
run_knn_classification <- function(Y_train, Y_test, image_ids_train, image_ids_test,
metrics, param_grid, fold, n_pc, E_pc, FDA_dims = NA, results) {
for (metric in metrics) {
if (!(metric %in% names(param_grid))) {
stop(paste("Unknown metric:", metric))
}
for (k in param_grid$k_values) {
for (threshold in param_grid[[metric]]) {
cat("Testing Metric:", metric, "| k:", k, "| n_pc:", n_pc, "| fda_dims:", FDA_dims, "| threshold:", threshold, "\n")
predictions <- sapply(1:nrow(Y_test), function(i) {
knn_classifier(k = k, threshold = threshold, y_new = Y_test[i, ],
Y = Y_train, image_ids = image_ids_train, metric = metric, E = E_pc)
})
known_mask <- image_ids_test %in% image_ids_train
unknown_mask <- !image_ids_test %in% image_ids_train
known_accuracy <- ifelse(sum(known_mask) > 0, mean(predictions[known_mask] == image_ids_test[known_mask]), NA)
unknown_accuracy <- ifelse(sum(unknown_mask) > 0, mean(predictions[unknown_mask] == 0), NA)
balanced_accuracy <- mean(c(known_accuracy, unknown_accuracy), na.rm = TRUE)
results <- rbind(results, data.frame(Fold = fold, n_pcs = n_pc,
Metric = metric, k = k, Threshold = threshold,
Known_Accuracy = known_accuracy, Unknown_Accuracy = unknown_accuracy,
Balanced_Accuracy = balanced_accuracy, FDA_dims = FDA_dims))
}
}
}
return(results)
}
Our function find_best_params plays a pivotal role in
our grid search for optimal classifier parameters by separating the
parts of the search process that differ between classifiers that use PCA
and those that do not. This function receives a boolean parameter
compute_pca which, if TRUE, instructs us to compute PCA
using only the training images, and then to project all the images into
the PCA space. We then iterate over the number of principal components
specified in the parameter grid and call
run_knn_classification for each set of components to
continue with the grid search and obtain classification results.
Moreover, if the boolean parameter apply_fda is set to
TRUE, after computing the PCA projection we further apply FDA on the PCA
matrix by appending the class labels and computing FDA on these
principal components. It is important to note that when
apply_fda is TRUE, only the modified_sse
metric is used because employing either the eigenvalues from the FDA or
the PCA does not make sense in that context. On the other hand, if
compute_pca is FALSE, we also force the metric to be
modified_sse since the other metrics rely on eigenvalues
derived from PCA. In either case, find_best_params
ultimately calls run_knn_classification to continue with
the grid search.
find_best_params <- function(M, image_ids, metrics, param_grid, num_folds = 5,
num_people_to_exclude = 1, compute_pca = TRUE, apply_fda = FALSE) {
# Create random splits based on image IDs
folds <- create_random_splits_by_image(image_ids, num_splits = num_folds,
num_people_to_exclude = num_people_to_exclude)
results <- list()
for (fold in seq_along(folds)) {
test_indices <- folds[[fold]]$test
train_indices <- folds[[fold]]$train
cat("Fold:", fold, "\n")
if (compute_pca || apply_fda) {
cat("Computing PCA...\n")
pca_results <- compute_pca(M[train_indices, ])
mean_obs <- pca_results$mean_obs
P <- pca_results$P
E <- pca_results$E
cat("PCA computed. Projecting data...\n")
# Project data using PCA (all available PCA components)
Y <- sweep(M, 2, mean_obs, FUN = "-") %*% P
} else {
cat("Skipping PCA. Using original data...\n")
Y <- M # Use original data
}
cat("dim(Y):", dim(Y), "\n")
# If PCA is not computed, restrict metrics to 'modified_sse'
if (!compute_pca || apply_fda == TRUE) metrics <- c("modified_sse")
# Process classification using the appropriate representation
if (compute_pca || apply_fda) {
for (n_pc in param_grid$n_pcs) {
# Use the first n_pc PCA components
Y_pc <- Y[, 1:n_pc, drop = FALSE]
if (apply_fda) {
cat("Applying FDA on PCA matrix with", n_pc, "components...\n")
# Append class labels to Y_pc and compute FDA on these n_pc components
Y_pc_with_class <- cbind(Y_pc, class = image_ids)
fda_results <- fda(Y_pc_with_class)
# Project Y_pc into FDA space
FDA_proj <- Y_pc %*% fda_results$P_fda
# Loop over the FDA dimensions specified in the grid
for (n_fda in param_grid$fda_dims) {
# Skip FDA dimension if it exceeds the number of PCA components used
if (n_fda > n_pc) next
FDA_dims <- n_fda # Use the loop variable directly.
Y_used <- FDA_proj[, 1:FDA_dims, drop = FALSE]
E_used <- fda_results$E_fda[1:FDA_dims, drop = FALSE]
cat("Using", n_pc, "PCA components and", FDA_dims, "FDA components.\n")
Y_train <- Y_used[train_indices, , drop = FALSE]
Y_test <- Y_used[test_indices, , drop = FALSE]
image_ids_train <- image_ids[train_indices]
image_ids_test <- image_ids[test_indices]
results <- run_knn_classification(Y_train, Y_test, image_ids_train, image_ids_test,
metrics, param_grid, fold, n_pc, E_used, FDA_dims, results)
}
} else {
# If only PCA is computed, use the PCA subspace directly
Y_used <- Y_pc
E_used <- E[1:n_pc, drop = FALSE]
n_used <- n_pc
Y_train <- Y_used[train_indices, , drop = FALSE]
Y_test <- Y_used[test_indices, , drop = FALSE]
image_ids_train <- image_ids[train_indices]
image_ids_test <- image_ids[test_indices]
results <- run_knn_classification(Y_train, Y_test, image_ids_train, image_ids_test,
metrics, param_grid, fold, n_pc, E_used, FDA_dims = NA, results)
}
}
} else {
# If no PCA is computed, use the original data directly.
Y_train <- Y[train_indices, , drop = FALSE]
Y_test <- Y[test_indices, , drop = FALSE]
image_ids_train <- image_ids[train_indices]
image_ids_test <- image_ids[test_indices]
results <- run_knn_classification(Y_train, Y_test, image_ids_train, image_ids_test,
metrics, param_grid, fold, n_pc = NA, E_pc = NULL, FDA_dims = NA, results)
}
}
return(results)
}
The function get_best_params is used to handle the grid
search for optimal classifier parameters, either for finding the best
threshold for each distance metric or for determining the best values of
\(k\) and n_pc (and, if
applicable, FDA dimensions) by fold. This function accepts several
parameters:
best_threshold: Boolean indicating if we are looking
for the best threshold or not.M: Matrix with the raw RGB image representation. One
image by row.image_ids: ids of the images in the dataset.param_grid: Grid of parameters to try.num_folds: Number of train/test splits to
generate.num_people_to_exclude: Number of people whose images
will be excluded from the test set and included in the test.compute_pca: Boolean indicating whether we compute the
PCA and project the images, or we just use the raw image
representation.apply_fda: Boolean indicating whether to apply FDA on
the PCA projection.If best_threshold is TRUE, the function returns a data
structure that includes a data frame containing the best threshold per
fold and per metric, another data frame with the average best threshold
by metric, and a simpler, easily accessible vector of the best
thresholds per metric. In this mode, the function identifies, for each
fold and each metric, the threshold that yields the highest balanced
accuracy, and then aggregates these results across folds.
On the other hand, if best_threshold is FALSE, the
function focuses on finding the optimal \(k\) (number of neighbors) and
n_pc (number of principal components) by fold. It returns a
data frame that contains the best \(k\)
and n_pc values for each fold, a summary data frame that
provides the average best \(k\),
average best n_pc, and average accuracy scores per metric,
and additional information about which metric was the best (i.e., the
one that performed best in the most folds) along with the average k and
n_pc for that metric. Importantly, when
apply_fda is set to TRUE, the function adds the same
information about FDA dimensions as the information gathered for
n_pc.
get_best_params <- function(best_threshold = TRUE, M, image_ids, metrics, param_grid,
num_folds = 5, num_people_to_exclude = 1,
compute_pca = TRUE, apply_fda = FALSE) {
# Compute the parameters using either PCA only or PCA followed by FDA (if apply_fda is TRUE)
results <- find_best_params(M = M, image_ids = image_ids, metrics = metrics,
param_grid = param_grid, num_folds = num_folds,
num_people_to_exclude = num_people_to_exclude,
compute_pca = compute_pca, apply_fda = apply_fda)
if (best_threshold) {
# Step 1: Get the best threshold per fold and metric
best_thresholds_per_fold <- results %>%
group_by(Fold, Metric) %>%
slice_max(Balanced_Accuracy, n = 1, with_ties = FALSE) %>% # Select best threshold per fold and metric
select(Fold, Metric, Threshold, Balanced_Accuracy) %>%
ungroup()
# Step 2: Compute the average threshold per metric
average_best_thresholds <- best_thresholds_per_fold %>%
group_by(Metric) %>%
summarise(Average_Threshold = mean(Threshold, na.rm = TRUE)) %>%
ungroup()
# Convert to a named vector for easy access
best_thresholds_vector <- average_best_thresholds %>%
pull(Average_Threshold, name = Metric)
return(list(best_thresholds_per_fold = best_thresholds_per_fold,
average_best_thresholds = average_best_thresholds,
best_thresholds = best_thresholds_vector))
} else {
# Step 1: Get the best row per fold based on Balanced Accuracy
best_per_fold <- results %>%
group_by(Fold) %>%
slice_max(Balanced_Accuracy, with_ties = FALSE) %>% # Select row with highest Balanced Accuracy per fold
ungroup()
print(best_per_fold)
# Step 2: Compute the average k, n_pcs, and (if applicable) FDA_dims by Metric
average_k_npcs_by_metric <- best_per_fold %>%
group_by(Metric) %>%
summarise(
Occurrences = n(),
Average_k = round(mean(k, na.rm = TRUE)),
Average_n_pcs = round(mean(n_pcs, na.rm = TRUE)),
Average_Known_Accuracy = mean(Known_Accuracy, na.rm = TRUE),
Average_Unknown_Accuracy = mean(Unknown_Accuracy, na.rm = TRUE),
Average_Balanced_Accuracy = mean(Balanced_Accuracy, na.rm = TRUE),
Average_fda_dims = if (apply_fda) round(mean(FDA_dims, na.rm = TRUE)) else NA
) %>%
ungroup()
print(average_k_npcs_by_metric)
# Step 3: Find the best metric (here we choose the metric with the highest occurrence count)
best_metric <- average_k_npcs_by_metric %>%
slice_max(Occurrences, with_ties = FALSE) %>%
pull(Metric)
# Step 4: Compute the average k and n_pcs ONLY for the best metric
final_avg_k <- average_k_npcs_by_metric %>%
filter(Metric == best_metric) %>%
pull(Average_k) %>%
as.integer()
cat("final_avg_k: ", final_avg_k, "\n")
final_avg_n_pcs <- average_k_npcs_by_metric %>%
filter(Metric == best_metric) %>%
pull(Average_n_pcs) %>%
as.integer()
cat("final_avg_n_pcs: ", final_avg_n_pcs, "\n")
final_avg_fda_dims <- if (apply_fda) {
average_k_npcs_by_metric %>%
filter(Metric == best_metric) %>%
pull(Average_fda_dims) %>%
as.integer()
} else {
NA
}
cat("final_avg_fda_dims: ", final_avg_fda_dims, "\n")
return(list(best_per_fold = best_per_fold,
average_k_npcs_by_metric = average_k_npcs_by_metric,
best_metric = best_metric,
final_avg_k = final_avg_k,
final_avg_n_pcs = final_avg_n_pcs,
final_avg_fda_dims = final_avg_fda_dims))
}
}
As it’s explained before, the rest of parameters will be fixed on this stage. The ranges of thresholds for each metric were selected by trying the distances manually between images of the same person and between images of different person. We took the min and the max distance among different trials, and the ranges are made so those min and max values found are included.
print("You will not see the text output.")
## [1] "You will not see the text output."
if(COMPUTE_GRID_SEARCH){
cat("Finding best threshold...\n")
param_grid_thresholds <- list(
n_pcs = c(30), # Fixed n_pcs
k_values = c(5), # Fixed k
modified_sse = seq(3e-5, 2.5e-4, by = 5e-6), # Threshold
mahalanobis = seq(-6000, -1500, by = 100),# Threshold
weighted_angle = seq(-0.07, -0.009, by = 0.005)# Threshold
)
best_thresholds_pca_results <- get_best_params(best_threshold=TRUE, M,image_ids=image_ids, metrics=c('modified_sse', 'mahalanobis', 'weighted_angle'), param_grid=param_grid_thresholds, num_folds = 15, num_people_to_exclude=2, compute_pca=TRUE)
saveRDS(best_thresholds_pca_results, "best_thresholds_pca_results.rds")
}
best_thresholds_pca_results <- readRDS("best_thresholds_pca_results.rds")
kable(best_thresholds_pca_results$best_thresholds_per_fold)
| Fold | Metric | Threshold | Balanced_Accuracy |
|---|---|---|---|
| 1 | mahalanobis | -3.60e+03 | 0.6707317 |
| 1 | modified_sse | 3.50e-05 | 0.8170732 |
| 1 | weighted_angle | -6.50e-02 | 0.6849593 |
| 2 | mahalanobis | -2.90e+03 | 0.7439024 |
| 2 | modified_sse | 1.15e-04 | 0.8780488 |
| 2 | weighted_angle | -5.00e-02 | 0.9024390 |
| 3 | mahalanobis | -3.50e+03 | 0.7073171 |
| 3 | modified_sse | 8.50e-05 | 0.9146341 |
| 3 | weighted_angle | -5.00e-02 | 0.9634146 |
| 4 | mahalanobis | -3.20e+03 | 0.7073171 |
| 4 | modified_sse | 1.20e-04 | 0.8780488 |
| 4 | weighted_angle | -4.50e-02 | 0.9268293 |
| 5 | mahalanobis | -3.30e+03 | 0.6585366 |
| 5 | modified_sse | 1.00e-04 | 0.8658537 |
| 5 | weighted_angle | -4.50e-02 | 0.9512195 |
| 6 | mahalanobis | -3.50e+03 | 0.6585366 |
| 6 | modified_sse | 1.05e-04 | 0.8170732 |
| 6 | weighted_angle | -5.50e-02 | 0.8069106 |
| 7 | mahalanobis | -3.10e+03 | 0.7439024 |
| 7 | modified_sse | 8.00e-05 | 0.8414634 |
| 7 | weighted_angle | -5.50e-02 | 0.8902439 |
| 8 | mahalanobis | -2.70e+03 | 0.8048780 |
| 8 | modified_sse | 6.50e-05 | 0.8658537 |
| 8 | weighted_angle | -5.00e-02 | 0.8902439 |
| 9 | mahalanobis | -2.90e+03 | 0.7804878 |
| 9 | modified_sse | 1.10e-04 | 0.9024390 |
| 9 | weighted_angle | -5.00e-02 | 0.9268293 |
| 10 | mahalanobis | -3.00e+03 | 0.7804878 |
| 10 | modified_sse | 1.30e-04 | 0.9024390 |
| 10 | weighted_angle | -6.00e-02 | 0.8658537 |
| 11 | mahalanobis | -2.60e+03 | 0.8428571 |
| 11 | modified_sse | 8.50e-05 | 0.9000000 |
| 11 | weighted_angle | -6.50e-02 | 0.7571429 |
| 12 | mahalanobis | -2.60e+03 | 0.8048780 |
| 12 | modified_sse | 1.30e-04 | 0.8536585 |
| 12 | weighted_angle | -4.50e-02 | 0.8902439 |
| 13 | mahalanobis | -3.10e+03 | 0.7682927 |
| 13 | modified_sse | 9.00e-05 | 0.9146341 |
| 13 | weighted_angle | -6.50e-02 | 0.8363821 |
| 14 | mahalanobis | -2.50e+03 | 0.8857143 |
| 14 | modified_sse | 1.00e-04 | 0.9428571 |
| 14 | weighted_angle | -5.50e-02 | 0.9428571 |
| 15 | mahalanobis | -3.70e+03 | 0.7073171 |
| 15 | modified_sse | 5.00e-05 | 0.9146341 |
| 15 | weighted_angle | -5.50e-02 | 0.9512195 |
kable(best_thresholds_pca_results$average_best_thresholds)
| Metric | Average_Threshold |
|---|---|
| mahalanobis | -3.08e+03 |
| modified_sse | 9.33e-05 |
| weighted_angle | -5.40e-02 |
best_thresholds_pca_results$best_thresholds
## mahalanobis modified_sse weighted_angle
## -3.080000e+03 9.333333e-05 -5.400000e-02
In the previous step, we found the best threshold for each metric. So now we have fixed those values. We will iterate over the first 30 number of components. We chose that number because at any trial we did during the implementation we didn’t find better results above that number, so in order to save computational time, we limit the number of components to try. The same argument applies to the decision of trying \(k\) up to 10.
if(COMPUTE_GRID_SEARCH){
cat("Finding best params...\n")
best_thresholds_pca_results <- readRDS("best_thresholds_pca_results.rds")
best_thresholds_pca <- best_thresholds_pca_results$best_thresholds
param_grid_others <- list(
n_pcs = seq(10, 30, by=1),
#n_pcs = c(4),
k_values = seq(1, 10, by=1),
#k_values = c(1),
modified_sse = c(best_thresholds_pca["modified_sse"]),
mahalanobis = c(best_thresholds_pca["mahalanobis"]),
weighted_angle = c(best_thresholds_pca["weighted_angle"])
)
best_params_pca <- get_best_params(best_threshold=FALSE, M,image_ids=image_ids, metrics=c('modified_sse', 'mahalanobis', 'weighted_angle'), param_grid=param_grid_others, num_folds = 15, num_people_to_exclude=2, compute_pca=TRUE)
saveRDS(best_params_pca, "best_params_pca.rds")
}
best_params_pca <- readRDS("best_params_pca.rds")
kable(best_params_pca$best_per_fold)
| Fold | n_pcs | Metric | k | Threshold | Known_Accuracy | Unknown_Accuracy | Balanced_Accuracy | FDA_dims |
|---|---|---|---|---|---|---|---|---|
| 1 | 16 | modified_sse | 1 | 9.33e-05 | 0.9024390 | 1 | 0.9512195 | NA |
| 2 | 12 | modified_sse | 1 | 9.33e-05 | 0.9756098 | 1 | 0.9878049 | NA |
| 3 | 21 | weighted_angle | 1 | -5.40e-02 | 0.9756098 | 1 | 0.9878049 | NA |
| 4 | 24 | weighted_angle | 1 | -5.40e-02 | 0.9512195 | 1 | 0.9756098 | NA |
| 5 | 25 | weighted_angle | 1 | -5.40e-02 | 0.9512195 | 1 | 0.9756098 | NA |
| 6 | 15 | modified_sse | 1 | 9.33e-05 | 0.9268293 | 1 | 0.9634146 | NA |
| 7 | 26 | weighted_angle | 1 | -5.40e-02 | 0.9756098 | 1 | 0.9878049 | NA |
| 8 | 18 | modified_sse | 1 | 9.33e-05 | 0.9512195 | 1 | 0.9756098 | NA |
| 9 | 22 | modified_sse | 1 | 9.33e-05 | 0.9512195 | 1 | 0.9756098 | NA |
| 10 | 14 | modified_sse | 1 | 9.33e-05 | 0.9512195 | 1 | 0.9756098 | NA |
| 11 | 18 | modified_sse | 1 | 9.33e-05 | 1.0000000 | 1 | 1.0000000 | NA |
| 12 | 24 | weighted_angle | 3 | -5.40e-02 | 0.9512195 | 1 | 0.9756098 | NA |
| 13 | 26 | weighted_angle | 3 | -5.40e-02 | 1.0000000 | 1 | 1.0000000 | NA |
| 14 | 29 | weighted_angle | 1 | -5.40e-02 | 0.9428571 | 1 | 0.9714286 | NA |
| 15 | 26 | weighted_angle | 1 | -5.40e-02 | 0.9512195 | 1 | 0.9756098 | NA |
kable(best_params_pca$average_k_npcs_by_metric)
| Metric | Occurrences | Average_k | Average_n_pcs | Average_Known_Accuracy | Average_Unknown_Accuracy | Average_Balanced_Accuracy | Average_fda_dims |
|---|---|---|---|---|---|---|---|
| modified_sse | 7 | 1 | 16 | 0.9512195 | 1 | 0.9756098 | NA |
| weighted_angle | 8 | 2 | 25 | 0.9623693 | 1 | 0.9811847 | NA |
best_params_pca$best_metric
## [1] "weighted_angle"
best_params_pca$final_avg_k
## [1] 2
best_params_pca$final_avg_n_pcs
## [1] 25
Same procedure as in PCA-KNN is implemented.
if(COMPUTE_GRID_SEARCH){
cat("Finding best parameters...\n")
param_grid_thresholds <- list(
n_pcs = c(30), # Fixed n_pcs
k_values = c(5), # Fixed k
modified_sse = seq(from = 2.0e-07, to = 1.1e-05, length.out = 30), # Threshold
mahalanobis = seq(-6000, -1500, by = 100),
weighted_angle = seq(-0.07, -0.009, by = 0.005)
)
best_thresholds_no_pca_results <- get_best_params(best_threshold=TRUE, M,image_ids=image_ids, metrics=c('modified_sse', 'mahalanobis', 'weighted_angle'), param_grid=param_grid_thresholds, num_folds = 5, num_people_to_exclude=2, compute_pca=FALSE)
saveRDS(best_thresholds_no_pca_results, "best_thresholds_no_pca_results.rds")
}
best_thresholds_no_pca_results <- readRDS("best_thresholds_no_pca_results.rds")
kable(best_thresholds_no_pca_results$best_thresholds_per_fold)
| Fold | Metric | Threshold | Balanced_Accuracy |
|---|---|---|---|
| 1 | modified_sse | 6.5e-06 | 0.7682927 |
| 2 | modified_sse | 8.4e-06 | 0.9024390 |
| 3 | modified_sse | 4.7e-06 | 0.9024390 |
| 4 | modified_sse | 5.4e-06 | 0.8780488 |
| 5 | modified_sse | 5.4e-06 | 0.8536585 |
kable(best_thresholds_no_pca_results$average_best_thresholds)
| Metric | Average_Threshold |
|---|---|
| modified_sse | 6.1e-06 |
best_thresholds_no_pca_results$best_thresholds
## modified_sse
## 6.084138e-06
if(COMPUTE_GRID_SEARCH){
cat("Finding best parameters...\n")
best_thresholds_no_pca_results <- readRDS("best_thresholds_no_pca_results.rds")
best_thresholds_no_pca <- best_thresholds_no_pca_results$best_thresholds
param_grid_others <- list(
n_pcs = seq(10, 30, by=1),
#n_pcs = c(4),
k_values = seq(1, 10, by=1),
#k_values = c(1),
modified_sse = c(best_thresholds_no_pca["modified_sse"]),
mahalanobis = c(best_thresholds_no_pca["mahalanobis"]),
weighted_angle = c(best_thresholds_no_pca["weighted_angle"])
)
best_params_no_pca <- get_best_params(best_threshold=FALSE, M,image_ids=image_ids, metrics=c('modified_sse', 'mahalanobis', 'weighted_angle'), param_grid=param_grid_others, num_folds = 5, num_people_to_exclude=2, compute_pca=FALSE)
saveRDS(best_params_no_pca, "best_params_no_pca.rds")
}
best_params_no_pca <- readRDS("best_params_no_pca.rds")
kable(best_params_no_pca$best_per_fold)
| Fold | n_pcs | Metric | k | Threshold | Known_Accuracy | Unknown_Accuracy | Balanced_Accuracy | FDA_dims |
|---|---|---|---|---|---|---|---|---|
| 1 | NA | modified_sse | 1 | 6.1e-06 | 0.8292683 | 1 | 0.9146341 | NA |
| 2 | NA | modified_sse | 1 | 6.1e-06 | 0.8780488 | 1 | 0.9390244 | NA |
| 3 | NA | modified_sse | 1 | 6.1e-06 | 0.9512195 | 1 | 0.9756098 | NA |
| 4 | NA | modified_sse | 1 | 6.1e-06 | 0.8780488 | 1 | 0.9390244 | NA |
| 5 | NA | modified_sse | 1 | 6.1e-06 | 0.9024390 | 1 | 0.9512195 | NA |
kable(best_params_no_pca$average_k_npcs_by_metric)
| Metric | Occurrences | Average_k | Average_n_pcs | Average_Known_Accuracy | Average_Unknown_Accuracy | Average_Balanced_Accuracy | Average_fda_dims |
|---|---|---|---|---|---|---|---|
| modified_sse | 5 | 1 | NaN | 0.8878049 | 1 | 0.9439024 | NA |
best_params_no_pca$best_metric
## [1] "modified_sse"
best_params_no_pca$final_avg_k
## [1] 1
The same procedure as in PCA-KNN is implemented. In this case, we
also fix fda_dims to save computational time. The chosen
fixed value is 24 because in this case we have 25 classes, so setting a
higher value will just add up noise.
if(COMPUTE_GRID_SEARCH){
cat("Finding best threshold...\n")
param_grid_thresholds <- list(
n_pcs = c(30), # Fixed n_pcs
k_values = c(5), # Fixed k
fda_dims = c(24), #seq(10,24, by = 1)
modified_sse = seq(3e-5, 2.5e-4, by = 5e-6), # Threshold
mahalanobis = seq(-6000, -1500, by = 100),# Not used
weighted_angle = seq(-0.07, -0.009, by = 0.005)# Not used
)
best_thresholds_fda_results <- get_best_params(best_threshold=TRUE, M,image_ids=image_ids, metrics=c('modified_sse', 'mahalanobis', 'weighted_angle'), param_grid=param_grid_thresholds, num_folds = 15, num_people_to_exclude=2, compute_pca=TRUE, apply_fda = TRUE)
saveRDS(best_thresholds_fda_results, "best_thresholds_fda_results.rds")
}
best_thresholds_fda_results <- readRDS("best_thresholds_fda_results.rds")
kable(best_thresholds_fda_results$best_thresholds_per_fold)
| Fold | Metric | Threshold | Balanced_Accuracy |
|---|---|---|---|
| 1 | modified_sse | 0.000065 | 0.8170732 |
| 2 | modified_sse | 0.000120 | 0.9146341 |
| 3 | modified_sse | 0.000180 | 0.9390244 |
| 4 | modified_sse | 0.000200 | 0.8902439 |
| 5 | modified_sse | 0.000190 | 0.8658537 |
| 6 | modified_sse | 0.000250 | 0.8536585 |
| 7 | modified_sse | 0.000215 | 0.8780488 |
| 8 | modified_sse | 0.000110 | 0.9146341 |
| 9 | modified_sse | 0.000080 | 0.8902439 |
| 10 | modified_sse | 0.000235 | 0.9024390 |
| 11 | modified_sse | 0.000130 | 0.8714286 |
| 12 | modified_sse | 0.000155 | 0.8292683 |
| 13 | modified_sse | 0.000140 | 0.8902439 |
| 14 | modified_sse | 0.000225 | 0.9571429 |
| 15 | modified_sse | 0.000245 | 0.9390244 |
kable(best_thresholds_fda_results$average_best_thresholds)
| Metric | Average_Threshold |
|---|---|
| modified_sse | 0.0001693 |
best_thresholds_fda_results$best_thresholds
## modified_sse
## 0.0001693333
In this next grid search the values of the fda_dims
start in 10 instead of starting with a lower value to save some
computational time. Also, n_pcs is capped to 24 (instead of
30) to reduce computational time, but if it is desired, the grid can be
widen to obtain an even more precise result. Actually, the best option
obviously will be to compute all the combinations in only one grid, but
the need to separate it in two is also obvious, because it’s too
computationally expensive.
if(COMPUTE_GRID_SEARCH){
cat("Finding best params...\n")
if(COMPUTE_GRID_SEARCH){
cat("Finding best params...\n")
best_thresholds_fda_results <- readRDS("best_thresholds_fda_results.rds")
best_thresholds_fda <- best_thresholds_fda_results$best_thresholds
param_grid_others <- list(
n_pcs = seq(10, 24, by=1),
k_values = seq(1, 10, by=1),
fda_dims = seq(10,24, by = 1),
modified_sse = c(best_thresholds_fda["modified_sse"]),
mahalanobis = c(best_thresholds_fda["mahalanobis"]),
weighted_angle = c(best_thresholds_fda["weighted_angle"])
)
best_params_fda <- get_best_params(best_threshold=FALSE, M,image_ids=image_ids, metrics=c('modified_sse', 'mahalanobis', 'weighted_angle'), param_grid=param_grid_others, num_folds = 15, num_people_to_exclude=2, compute_pca=TRUE, apply_fda = TRUE)
saveRDS(best_params_fda, "best_params_fda.rds")
}
}
best_params_fda <- readRDS("best_params_fda.rds")
kable(best_params_fda$best_per_fold)
| Fold | n_pcs | Metric | k | Threshold | Known_Accuracy | Unknown_Accuracy | Balanced_Accuracy | FDA_dims |
|---|---|---|---|---|---|---|---|---|
| 1 | 23 | modified_sse | 1 | 0.0001693 | 0.9756098 | 1 | 0.9878049 | 12 |
| 2 | 14 | modified_sse | 1 | 0.0001693 | 1.0000000 | 1 | 1.0000000 | 10 |
| 3 | 17 | modified_sse | 1 | 0.0001693 | 1.0000000 | 1 | 1.0000000 | 12 |
| 4 | 22 | modified_sse | 1 | 0.0001693 | 1.0000000 | 1 | 1.0000000 | 10 |
| 5 | 19 | modified_sse | 1 | 0.0001693 | 1.0000000 | 1 | 1.0000000 | 14 |
| 6 | 22 | modified_sse | 1 | 0.0001693 | 1.0000000 | 1 | 1.0000000 | 16 |
| 7 | 23 | modified_sse | 1 | 0.0001693 | 1.0000000 | 1 | 1.0000000 | 10 |
| 8 | 15 | modified_sse | 1 | 0.0001693 | 1.0000000 | 1 | 1.0000000 | 10 |
| 9 | 20 | modified_sse | 1 | 0.0001693 | 1.0000000 | 1 | 1.0000000 | 10 |
| 10 | 22 | modified_sse | 1 | 0.0001693 | 1.0000000 | 1 | 1.0000000 | 12 |
| 11 | 13 | modified_sse | 1 | 0.0001693 | 1.0000000 | 1 | 1.0000000 | 12 |
| 12 | 23 | modified_sse | 1 | 0.0001693 | 0.9512195 | 1 | 0.9756098 | 10 |
| 13 | 23 | modified_sse | 1 | 0.0001693 | 1.0000000 | 1 | 1.0000000 | 13 |
| 14 | 22 | modified_sse | 1 | 0.0001693 | 1.0000000 | 1 | 1.0000000 | 10 |
| 15 | 20 | modified_sse | 1 | 0.0001693 | 1.0000000 | 1 | 1.0000000 | 10 |
kable(best_params_fda$average_k_npcs_by_metric)
| Metric | Occurrences | Average_k | Average_n_pcs | Average_Known_Accuracy | Average_Unknown_Accuracy | Average_Balanced_Accuracy | Average_fda_dims |
|---|---|---|---|---|---|---|---|
| modified_sse | 15 | 1 | 20 | 0.995122 | 1 | 0.997561 | 11 |
best_params_fda$final_avg_k
## [1] 1
best_params_fda$final_avg_n_pcs
## [1] 20
best_params_fda$final_avg_fda_dims
## [1] 11
The results show that applying PCA before KNN improves balanced accuracy (0.981 vs. 0.943) while significantly reducing computation time. Even if we expected a larger performance gap, the similarity in results suggests that KNN can still work well in high-dimensional spaces when properly tuned. Notice that the difference in balanced accuracy is not that big, but we can appreciate a bigger difference if we take a look to the known accuracy, suggesting that the model that is not using PCA, is finding more problems in correctly classifying the known faces. The PCA approach retains enough variance to achieve strong recognition while making training and inference more efficient.
We should choose the PCA-based KNN model because it achieves better accuracy while drastically reducing computation time and memory usage. It is more scalable, less prone to overfitting, and provides a more efficient representation of facial features without sacrificing recognition performance.
The grid search results under the FDA approach indicate that the optimal configuration to use is \(k\) = 1, 20 principal components, and 11 FDA dimensions. Specifically, the summary data reveals an average known accuracy of 0.995, an average unknown accuracy of 1, and an average balanced accuracy of 0.998. This demonstrates that applying FDA on top of PCA further refines our model’s ability to distinguish between classes, leading to near-perfect recognition performance. Overall, these findings confirm that the PCA-based KNN model, especially when enhanced with FDA, improves recognition accuracy—particularly for known faces.